INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Key Questions to Be Answered

Data Dictionary

Importing necessary libraries and data

Data Overview

Load and overview the dataset

Feature Creation

*### Two columns have a lot of zeros, that otherwise provide valuable information: Median Fill

No Missing Data?

Hotel Cancelation Ratio

Let's experiment with some new graphs we haven't learned!

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel? October
  2. Which market segment do most of the guests come from? Online
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments? Corp pays the least and Online the most = Aviation - 100.704000, Comp - 93.678849 Corp - 82.911740 Offline - 91.632679 Online - 113.079634
  4. What percentage of bookings are canceled? Number of canceled reservations: 11878 (32.78%)
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel? 15 / 926 = ~2%
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation? Yes

Univariate analysis

There are 2 adults on avg for each reservations.

Most reservations get book < 100 days out, however there are quite a few that go out >200+

Many reservations include 1 special request.

Most reservations are not repeat customers.

Starting in August....the hotels are busy. October os the highest volume month.

Most reservations do not have any special requests.

67% of reservations are not canceled.

! 2/3rds of all reservations are no made Online

Most people eat the most important meal of the day, breakfast and have it including in there reservation.

99% of all reservations, the person making the reservation has not canceled in the past.

There are a higher # of weekend night reservations that do get canceled.

Reservations that dont require a parking spot get canceled more. This may indicate people plan more for the time to drive and are less likely to cancel.

Canceled reservations have fewer special requests that not canceld ... may show some adavabnce thought of planning.

As the weeks progress the higher the cancelation rate is...especially around week 40.

Reservations that do not require parkeing get canceled more than those that do.

Room Type 6 has the highest cancelation rate.

Online resrvations get canceld most frequent ... followed by Offline and Aviation which are about equal.

Reservations that have no special request get canceled at almost 2x the rate of all other types.

Poeple that have 13 prior cancelation will cancel, followed by ~35% of all reswervations with no idication will get canceled.

Another view of the prior matrix with a different color scheme to highlight the variables that are correlated.

NOTE: WE WILL HAVE TO DROP EITHER MONTH OR WEEK BC THE ARE ~PERFECTLY CORRELATED

Data Preprocessing 1 - Basic Cleaning After Initial data Inspection

EDA Post Data Processing First Pass

Bivariate Analysis

Checking Multicollinearity

Functions to treat outliers by flooring and capping

FINAL EDA Summary

Insight: AFTER ALL data preprocessing this is what we end up with: Basic Info - Number of variables 18 - Number of observations 36238 - Missing cells 0 - Missing cells (%) 0.0% - Duplicate rows 0 - Duplicate rows (%) 0.0% - Total size in memory 5.0 MiB - Average record size in memory 144.0 B - Numeric 11 - Categorical 7 SUMMARY OF CORRELATIONS & OTHER FACTS. 1. month is highly correlated with week_of_year 2. week_of_year is highly correlated with month 3. month is highly correlated with week_of_year 4. week_of_year is highly correlated with month 5. month is highly correlated with week_of_year 6. week_of_year is highly correlated with month 7. booking_status is highly correlated with lead_time 8. no_of_previous_bookings_not_canceled is highly correlated with repeated_guest and 1 other fields 9. no_of_week_nights is highly correlated with no_of_weekend_nights 10. lead_time is highly correlated with booking_status 11. no_of_children is highly correlated with room_type_reserved 12. no_of_weekend_nights is highly correlated with no_of_week_nights 13. repeated_guest is highly correlated with no_of_previous_bookings_not_canceled 14. room_type_reserved is highly correlated with no_of_children 15. month is highly correlated with week_of_year 16. no_of_previous_cancellations is highly correlated with no_of_previous_bookings_not_canceled 17. week_of_year is highly correlated with month 18. no_of_previous_cancellations is highly skewed (γ1 = 25.19401422) ** We will drop either Month or Week These columns have valid '0s'...these should not be treated!!! - no_of_children has 33544 (92.6%) zeros - no_of_weekend_nights has 16872 (46.6%) zeros - no_of_week_nights has 2383 (6.6%) zeros - no_of_previous_cancellations has 35901 (99.1%) zeros - no_of_previous_bookings_not_canceled has 35429 (97.8%) zeros - no_of_special_requests has 19751 (54.5%) zeros

MODEL BUILDING - LOGISTIC REGRESSION & DECISION TREE APPROACHES

Logistic Regression Section

Data Preparation

Per Program Office Academic Support 17 "Sometimes, there exists some meaningful information in the outliers as well. For this project, it is advised not to treat outliers to avoid singular matrix error."

data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Meal Plan ","") data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Not Selected",0) data

data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Meal Plan","") data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Not Selected",0)

Logistic Regression (with Sklearn library)

Checking performance on training set

Accuracy is ok at ~80 ... which shows the model is over fitting compared to the recall # of .62%. More work can be done.

Checking performance on test set

The test results are almost the same ... Accuracy is ok at ~.80 ... which shows the model is over fitting compared to the recall # of .64 More work can be done.

Logistic Regression (with statsmodels library)

It ran!!! AFter all of that 'singualr matrix' error codes, i simpliefied the model and went back from the beginning and the work was worht it. Based on the intial results there are some P values >.05 that will need to be inspected closer.

Observations

Additional Information on VIF

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train3 as the final ones and lg3 as final model.

Coefficient interpretations

Converting coefficients to odds

Coefficient interpretations

Interpretation for other attributes can be done similarly.

Checking model performance on the training set

ROC-AUC

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Let's check the performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

Using model with threshold=0.33

Using model with threshold = 0.42

Model performance summary

training performance comparison

models_train_comp_df = pd.concat( [ log_reg_model_train_perf.T, log_reg_model_train_perf_threshold_auc_roc.T, log_reg_model_train_perf_threshold_curve.T, ], axis=1, ) models_train_comp_df.columns = [ "Logistic Regression sklearn", "Logistic Regression-0.35 Threshold", "Logistic Regression-0.42 Threshold", ]

print("Training performance comparison:") models_train_comp_df

Conclusion

Recommendations

To drive the likeihood of decreasing cancelations build pricing and programs around:

Online booking is barrier free, and most of the cancelations come from that segment:

===========================================

Building a Decision Tree model

Drop "Meal Plan" & "Room Type" and just keep the order number

data2["type_of_meal_plan"] = data2["type_of_meal_plan"].str.replace("Meal Plan",'') data2["type_of_meal_plan"] = data2["type_of_meal_plan"].replace("Not Selected", 0)

data2['room_type_reserved'] = data2['room_type_reserved'].str.replace('Room_Type','')

Encoding and replacing the words ' Not_Canceled and Canceled' with 0 and 1 to match required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes) convention

data2["booking_status"] = data2["booking_status"].replace("Not_Canceled", 0) data2["booking_status"] = data2["booking_status"].replace("Canceled", 1)

Change the market segment to numbers 0(online) 1(offline) 2(Corp) 3(comp) and 4(Aviation)

data2['market_segment_type'] = data2['market_segment_type'].replace("Online", 0) data2['market_segment_type'] = data2['market_segment_type'].replace("Offline", 1) data2['market_segment_type'] = data2['market_segment_type'].replace("Corporate", 2) data2['market_segment_type'] = data2['market_segment_type'].replace("Complementary", 3) data2['market_segment_type'] = data2['market_segment_type'].replace("Aviation", 4) data2.market_segment_type.value_counts()

creating dummy varibles

dummy_data = pd.get_dummies( data2, columns=[ "type_of_meal_plan", "room_type_reserved",

],
drop_first=True,

) dummy_data.head()

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Build Decision Tree Model

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

plt.figure(figsize=(20, 30)) out = tree.plot_tree( model, feature_names=feature_names, filled=True, fontsize=9, node_ids=False, class_names=None, )

below code will add arrows to the decision tree split if they are missing

for o in out: arrow = o.arrow_patch if arrow is not None: arrow.set_edgecolor("black") arrow.set_linewidth(1) plt.show()

THIS IS SUPER IMPOSSIBLE TO READ ;)

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Observations from the tree:

Using the above extracted decision rules we can make interpretations from the decision tree model like:

|--- lead_time <= 90.50 | |--- no_of_special_requests <= 1.50 | | |--- market_segment_type <= 0.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- weights: [91.80, 162.35] class: 1

| | | |--- no_of_weekend_nights > 0.50 | | | | |--- no_of_special_requests <= 0.50 | | | | | |--- weights: [212.85, 221.00] class: 1

You can keep readting the tree to find similar classes ... `Interpretations from other decision rules can be made similarly`

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.025 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.002~3 retaining information and getting higher recall.

checking performance on training set

checking performance on test set

Visualizing the Decision Tree

Creating model with < 0.005 ccp_alpha

Checking performance on the training set

Checking performance on the test set

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusions

Recommendations

While the decsion Trees allows us to make even more discrete A/B tests compared to the logistic model, the basic plrinciples are the same:

To drive the likeihood of decreasing cancelations build pricing and programs around:

Online booking is barrier free, and most of the cancelations come from that segment:

EXAMPLES - Each of these likely buckets of folks that will cancel could be target through A/B to test various levels of fees (Incentives)

  1. Implement short term cancelation fees for any reservation made <= 90 days.
  2. For all Corporate reservations that are on a weekend, implement a cancelation fees
  3. For Online reservations where the lead time is lead_time <= 8.50, implement a non refunable "short term" premimum fee equal to 40% of the total stay and/or refund only 60%.
  4. For example, for rooms >$200 consider inscreasing the cacelation fee.

Comparing Performance of both Logistic Regresssion & Decesion Trees

While both models produce results, if you consider comparing the Logistic Regresss (Statsmodel & Sklearn) the following observations can be made

Logistic

  1. Logistic requires a lot more effort in terms of data cleaning and observation, which can be time consuming and error prone.
  2. Logistic reqression is much faster in terms of how quickly the analysis can take place based on current envirnoment
  3. Logistic regression is easier to understand in terms of the coefficient meaning +/- and it's impact onm the dependendent variable
  4. There performance was OK ... changing the threshold balanced the recall and precision resulting in a F1 score of .69
  5. The main gap I see is there is still quite a bit of explainability that can not be discivered with the data we have.

Decision Tree

  1. The results would "appear" initially to be better, but that is a red herring b/c the model is overfitting on the tarining data.
  2. Hyperparameter tunning does appear to have very good results on both test and train, but may still be overfitting
  3. The post-tuned tree is the model I would select, as it has performance better than that of the logistic model in terms of closing the recall gap, and amongst the other options of the decision tree.
  4. The only draw back is that it is computational intensive and the results graphically are difficult to understand (vs coefficeints) which may be a personal level of comfort with the latter.

What also needs to be conisered is the production environment. If stable, menaing the data comes in with known defects and remains stable possibly the Logistic model can be used as a second check where the delat between the two becomes key metric or warning signal if something changes.

As for the decsiion tree, computationally the environment would need to be sized accordingly to ensure performance isn't an issue. If it is for some reason not feasible to run the DT in real-time ... use the Logistic model to screen the "easy" decesions and the DT to run a second pass against those that are not clearly candidates for approval.